## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
Mean quality is 5.636. Mean alcohol is 10.42. Median total.sulfur.dioxide is 38 while the max is 289, which suggests an outlier. 1599 observations and 13 variables.
The data was transformed with scale_x_continuous and binwidth of 0.1. Most fixed acidity ranges from x = 7-8.
Prior to transformation we notice an outlier of x = 1.6. After changing the binwidth to 0.005 and limiting the x-axis, the graph shows a very loose bimodal peak. I wonder how this graph relates to fixed.acitiy.
The original graph shows an outlier at x = 1. I changed the binwidth to .001 to better see the data. There are 2 x-values where the count is the highest, x = 0 and x = 0.49.
Majority of residual sugar is less than x = 4. I transformed the y-axis by sqrt and binwidth to .05 to better see the bars. Data does not appear in some ranges such as x = 10, 12, 14, I wonder why. I limited the data to get a closer look at the peak. Most of the data falls in the range of x = (1.5:2.5) with x = 2 being the most.
An overwhelming amout of chlorides are below 0.2. Changing the binwidth to .0005 and limiting the X-axis between (0, .15) we see that the range for chlorides is 0.05:0.1 with a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The graph is skewed to the right so I transformed it using log10(). This shows the value of free.sulfur.dioxide is prominent between 5-15. The max count is where x = 6 and the median x-value is 14.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
I log transformed the x-axis by log10 and binwidth to .001 since it was also scewed to the right. Compared to free sulfur dioxide, there is a better range of data. We have a max value of 289 but median of 38. The outliers are probably affecting the data.
Normal distribution. I transformed the graph by changing the binwidth to .000005 and limiting the x-axis to (.995:1). The density values are very different from one another since the max count is about 37.
Normal distribution. Changed the binwidth to .001 to see the bars better. The prime pH range is 3:3.5. This makes sense since the average for red wine is between 3.3 to 3.6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
We have outliers around x = 2. This graph is skewed to the the right so I transformed x-axis by log10() and binwidth to .0005. The values are very small and are prominent from 0.5:0.6, the Mean is x = 0.658 and median = 0.62. We can see the outliers don’t affect the data as much.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Created a new variable (alcohol_content) with low, medium, high alcohol content. There is a strict value (x=9.5%) where alcohol content is optimally chosen for wine. The count gradually decreases as content increase and immediately drops as content decreases below 9.5%. The average alcohol content is 10.42%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The scale ranges from 0:10 with 0 being the worst quality and 10 being the best. There are no quality values for 0,1,2,9, and 10. We can also observe a normal distribution.
1599 observations and 13 variables (x, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfu.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality.)
Other observations: Quality range is 0:10 and mean is 5.6 Max pH is 4 mean and median alcohol is about 10%
The quality of the wine.
The alcohol content to determine the boldness, the pH to determine acidity/tartness, and residual sugar to determine sweetness. They will all affect the quality of the wine.
Yes, I created alcohol_content which consists of 3 buckets (0, 9.5], (9.5, 11.1], and (11.1, 15]. Then I renamed each bucket to Low, Medium, and High to differentiate the amount of alcohol catagorically.
Most graphs where either normally distributed or skewed right. Citric acid was the only graph that was a little different since there was a sharp decrease from 0:0.25, then another surge at 0.50.
## X fixed.acidity volatile.acidity
## X 1.000000000 0.00000000 0.724669575
## fixed.acidity -0.268483920 1.00000000 0.000000000
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## citric.acid residual.sugar chlorides
## X 6.744481e-10 2.115297e-01 1.533390e-06
## fixed.acidity 0.000000e+00 4.199465e-06 1.751746e-04
## volatile.acidity 0.000000e+00 9.389168e-01 1.422491e-02
## citric.acid 1.000000e+00 8.083723e-09 2.220446e-16
## residual.sugar 1.435772e-01 1.000000e+00 2.617079e-02
## chlorides 2.038229e-01 5.560954e-02 1.000000e+00
## free.sulfur.dioxide -6.097813e-02 1.870490e-01 5.562147e-03
## total.sulfur.dioxide 3.553302e-02 2.030279e-01 4.740047e-02
## density 3.649472e-01 3.552834e-01 2.006323e-01
## pH -5.419041e-01 -8.565242e-02 -2.650261e-01
## sulphates 3.127700e-01 5.527121e-03 3.712605e-01
## alcohol 1.099032e-01 4.207544e-02 -2.211405e-01
## quality 2.263725e-01 1.373164e-02 -1.289066e-01
## free.sulfur.dioxide total.sulfur.dioxide
## X 2.915917e-04 2.297726e-06
## fixed.acidity 6.335579e-10 5.709033e-06
## volatile.acidity 6.747011e-01 2.213857e-03
## citric.acid 1.473916e-02 1.555454e-01
## residual.sugar 4.685141e-14 2.220446e-16
## chlorides 8.241238e-01 5.809120e-02
## free.sulfur.dioxide 1.000000e+00 0.000000e+00
## total.sulfur.dioxide 6.676665e-01 1.000000e+00
## density -2.194583e-02 7.126948e-02
## pH 7.037750e-02 -6.649456e-02
## sulphates 5.165757e-02 4.294684e-02
## alcohol -6.940835e-02 -2.056539e-01
## quality -5.065606e-02 -1.851003e-01
## density pH sulphates alcohol
## X 0.000000e+00 4.770847e-08 4.992031e-07 0.000000e+00
## fixed.acidity 0.000000e+00 0.000000e+00 1.648681e-13 1.364868e-02
## volatile.acidity 3.787554e-01 0.000000e+00 0.000000e+00 3.330669e-16
## citric.acid 0.000000e+00 0.000000e+00 0.000000e+00 1.059462e-05
## residual.sugar 0.000000e+00 6.065915e-04 8.252134e-01 9.258425e-02
## chlorides 5.551115e-16 0.000000e+00 0.000000e+00 0.000000e+00
## free.sulfur.dioxide 3.804985e-01 4.869975e-03 3.888321e-02 5.492314e-03
## total.sulfur.dioxide 4.354284e-03 7.818341e-03 8.601835e-02 1.110223e-16
## density 1.000000e+00 0.000000e+00 2.418474e-09 0.000000e+00
## pH -3.416993e-01 1.000000e+00 2.109424e-15 1.110223e-16
## sulphates 1.485064e-01 -1.966476e-01 1.000000e+00 1.783053e-04
## alcohol -4.961798e-01 2.056325e-01 9.359475e-02 1.000000e+00
## quality -1.749192e-01 -5.773139e-02 2.513971e-01 4.761663e-01
## quality
## X 7.857465e-03
## fixed.acidity 6.495635e-07
## volatile.acidity 0.000000e+00
## citric.acid 0.000000e+00
## residual.sugar 5.832180e-01
## chlorides 2.313383e-07
## free.sulfur.dioxide 4.283398e-02
## total.sulfur.dioxide 8.615331e-14
## density 1.874945e-12
## pH 2.096278e-02
## sulphates 0.000000e+00
## alcohol 0.000000e+00
## quality 1.000000e+00
https://stat.ethz.ch/pipermail/r-help/2001-November/016201.html
I found the cor.prob function online (listed above) that allows me to compare the correlation for each variable combination in the dataset. The best correlation for quality is residual sugar. I also imported the data as data_original to exclude the alcohol_content column, since I am not able to take the correlation of non integers.
F_Acidity(fixed.acidity), V_Acidity(volatile.acidity), F_Sulfur(free.sulfur.dioxide), T_Sulfur(total.sulfur.dioxide)
Wanted to look closer at the relationship of alcohol content and quality, but it’s difficult to draw any conclusions besides, low quality generally has low alcohol content. This might be becuase there was no high alcohol, low quality wine sampled. Majority of the quality is 5, 6, or 7.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
We can see some horizontal ‘lines’ which suggest the alcohol content is staying the same but there is a range of pH between 3-3.5. The mean pH is 3.311.
There is somewhat of a direct relationship shown in the first graph. After the transformation of the y-scale by sqrt(), it shows a pool of data points for y = 0. There are many y-value horizontal lines as well leading up to y = 0.25.
There are some very high values for residual sugar, 12-16 range, (likely more of the sweeter wines). No strict relationship can be determined.
There is a clear direct relationship.
Higher alcohol content wines contain greater average sulphates.
Most quantity of alcohol type are as follows: (alcohol_content, quality) –> (Low, 5), (Medium, 5 & 6), (High, 6 & 7).
Refering to alcohol vs quality graph: Higher quality (8) red wine has higher alcohol content while low quality (3 and 4) has lower content. Quality of 5, 6, and 7 have a bigger range but you can see a general shift. It’s hard to determine if there is a relationship though, since there isn’t enough data points for low and high quality wines compared to average quality.
Refering to alcohol vs pH: We can see multiple horizontal “lines” which suggests varying pH for the same alcohol content. This is probably becuase companies choose similar alcohol amounts but vary in tartness and crispness.
Refering to residual sugar vs fixed acidity: We can see pooling of data points below sugar of 4. There is no real relationship here since sugar ranges the same as acidity increases. The outlier data ranging from y = 12-16 is interesting but does not show any relationship to acidity.
Referring to quality vs alcohol_content: With the amount of data we have, there is a relationship between alcohol content and quality. Generally the higher content is of better quality since there are more data points. This might change if we introduce further data.
As pH decreases fixed acidity increases, obviously this makes sense since that is the definition of acidic solution.
I compared fixed acidity vs density and found a strong direct relationship. Quality is also affected by density and acidity as higher quality wines are more acidic and have less density.
The fixed acidity vs density graph.
This graph makes sense since higher acidic wines will naturally have a lower pH and vice versa. Coloring the graph by quality does not show any strong relationships.
After reviewing the correlation graph I decided to look into density and acidity. There seems to be a strong direct relationship where density increases as acidity increases. Adding quality by color shows lower quality (4 and 5) are high density and low acidity compared to (6, 7, 8).
Comparing the residual sugar and alcohol shows most of the red wine is below 4 which means they would be off-dry wines and range from 9-13% in alcohol content. When differentiating by quality it can be seen that higher quality wines have higher alcohol content (>10%) and generally less sweet.
Each graph is colored by alcohol_content and seperated by quality. We see medium and high alcohol content more prominent in higher quality (7, 8). There is also a small shift down on the y-axis (decrease in volatile acidity) as the quality increases.
Density of wine is closer to water (1.000) when alcohol content is lower and residual sugar ranges from 2 to 4.
There is a relationship between the amount of alcohol and the quality of the wine. Generally higher quality wines will be of higher alcohol content. We can also see a shift towards less volatile acidity. This makes sense since higher levels of acetic acid will cause an unpleasant taste.
I did not find anything too surprising but after facet wrapping the graph by quality it was easier to see the differences in quality and alcohol content. I was surprised to see how much they related since I was not able to see the relationship clearly in my previous graphs.
This graph shows the greatest alcohol % to be around 9.5% and we see a gradual decrease as the % increases or decreases. This is probably becuase that’s the optimal alcohol % that is time efficient (as in the amount of time needed for fermentation) for wholesale, or the amount most consumers prefer.
The top graph compares the alcohol % by quality of wine to better show which quality has the most data points (5,6,7). The bottom graph shows how acidity increases as density increases. It is also able to determine loosely, how the higher quality is more acidic.
This plot shows how the alcohol content changes with quality, mostly high content wines are promenent in quality 8 while the opposite is true for quality 3. There is also a small down shift towards volatile acidity as quality increases, I suspect to lower the vinger taste.
This dataset contains various chemicals commonly found in wine and also the quality rating for each wine. Starting off, I knew I wanted to answer the question “what chemicals affect the quality of the wine most?”. Through my intial analysis I found that acidity and alcohol % had pretty good distributions. Through further analysis I found that higher quality wines have a greater alcohol percent. Now, that does not mean higher alcohol percent will necesarily be of higher quality. Examining the coorelation plot shows a value of 0.476 for quality and alcohol which is the highest compared to all other quality paired correlations. I also noticed how acidiy was affected by density and that was a surprise since I did not think density had any strong relationships. I was able to determine that higher quality wines seem to show affinity for lower density and higher acidity, which makes sense since the alcohol % is increasing and moving further away from the density of water (1). These are most of the relationships that I have observered to influence quality but, quality is very subjective. I feel that a quality of 8 may be a quality of 3 to me or 10 to someone else, since it depends on the subtleties of wine that are not always picked up by everyone.
For future work I think it would be interesting to compare the quality of wine to the price. I would graph the quality on y-axis and price on x-axis to show the general relationship. Then, I would create a seperate graph for each quality and see how the price differs and try to find any correlation to other variables (acidity, sugar, density) that may also affect the price for each quality.